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Abstract 

Multiple testing is often carried out in practice using approximated p-values obtained, for 
instance, via bootstrap or permutation tests. We are interested in allocating a pre-specified 
total number of samples (that is draws from a bootstrap distribution or permutations) to 
all hypotheses in order to approximate their p-values in an optimal way, in the sense that 
the allocation minimizes the total expected number of misclassified hypotheses. By a mis- 
classified hypothesis we refer to a decision on single hypotheses which differs from the one 
obtained if all p-values were known analytically. Neither using a constant number of samples 
per p-value estimate nor more sophisticated approaches available in the literature guarantee 
the computation of an optimal allocation in the above sense. This article derives the optimal 
allocation of a finite total number of samples to a finite number of hypotheses tested using 
the Bonferroni correction. Simulation studies show that a simple sampling algorithm based 
on Thompson Sampling asympotically mimics this optimal allocation. 


1 Introduction 

Scientific studies are often evaluated by correcting for multiple comparisons, for instance using 


the Bonferroni (1936) correction or the procedure of Benjamini and Hochberg (1995). 


Although testing procedures such as the Bonferroni (1936) correction require exact knowl¬ 


edge of the p-value underlying each statistical test, the ideal p-values of the tests are usually not 
available analytically in practice and thus have to be approximated using Monte-Carlo meth¬ 
ods, for instance via bootstrap or permutation tests. By the ideal p-value we refer to the one 
obtained by integrating over the theoretical bootstrap distribution (in case of bootstrap tests) 
or by generating all permutations (in case of permutation tests). This scenario is common in 


scientific studies involving real data (Boca et al 

2014 Liu et al., 

2012 

Marschall et al., 

2012 

Brinza et al., 

2010; 

jin 

2005). 


2013 Chen and Ishwaran 


If the overall number of samples one is able to spend on the approximation of all p-values is 
limited as is the case in practice (for instance due to time constraints), increasing the approx¬ 
imation accuracy of certain p-value estimates comes at the cost of decreasing the accuracy of 
others. 

We are interested in deriving the optimal allocation of a finite number of samples to a finite 
number of hypotheses tested using the Bonferroni (1936) correction. By optimal we refer to 


the allocation which minimizes the total expected number of misclassified hypotheses, that is 
the expected number of hypotheses whose classification (rejected, non-rejected) differs from the 
one obtained with the ideal p-values. Recent examples for studies evaluated by applying the 
Bonferroni (1936) correction to approximated p-values include Thulin (2014), Zhang (2008) and 
Kim et al. (2007). 
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Several methods to compute significant and non-significant hypotheses via approximated 
p-values are available in the literature. For instance the method of Besag and Clifford (1991), 
the approaches by |Lin| (|2005|) , Ivan Wieringen et al.l (|2008l) , I Guo and Peddada (2008), the MCFDR 


algorithm of Sandve et al. (2011) or the MMCTest algorithm of Gandy and Hahn (2014). However, 
it is unclear how the allocation of samples computed by the algorithms aforementioned compares 
to the optimal allocation. 

This article is organized as follows. In Section [2] we will first state a mathematical formu¬ 
lation of the problem under investigation (Section |2.1[ ). We then derive the optimal allocation 
minimizing the overall expected number of misclassifications of hypotheses under the assumption 
that the number of samples to be allocated is continous (Section 2.2). This is achieved by solving 
a suitable optimization problem using the Kuhn and Tucker (1951) formalism (Section |2.3[ ). As 
in practice any allocation is discrete, the related discrete optimization problem is heuristcally 
solved to indicate that both the continous and discrete optimal solutions do not differ by much 
(Appendix [AJ . 

In Section [3] we demonstrate empirically that a simple sampling algorithm based on Thomp¬ 
son Sampling ([Thompson 


1933; 

Agrawal and Goyal, 2012 

) introduced in 

Gandy and Hahn 


(2015) asympotically mimics the optimal allocation of samples. 
The article concludes with a discussion in Section [H 


2 Deriving the optimal allocation for a normal approximation 


The following section introduces the problem under investigation in this article and presents a 
mathematical formulation as an optimization problem. The optimal allocation minimizing the 


expected number of misclassifications is derived with the help of a Kuhn and Tucker (1951) 
condition. 


2.1 Formulation of the problem 

Suppose we are interested in testing m hypotheses Hoi, • ■ •, Ho m for statistical significance using 


the Bonferroni (1936) correction at a fixed threshold a £ (0,1). The ideal but unknown p-value 
underlying each hypothesis Hq* is denoted by pi, i £ {1,..., m}. 


The Bonferroni (1936) correction returning the indices of rejected hypotheses is defined as 
B(p,a ) = {i : Pi < a}, where p = {pi, ■ ■ ■ ,p m ) is a vector of m p-values. The threshold a 
typically is a fixed significance level a* divided by the number of hypotheses m to correct for 
multiple tests, that is a = a*/m. 

As the ideal p-values are unknown, we assume that Monte-Carlo methods are used to approx¬ 
imate them via pi = Si/ki , where ki is the total number of random samples drawn for H q* and 
Si of these were significant. The significant samples can be modeled as Si ~ Binomial^,p^). 

We assume that in practical applications where the ideal p-values are unknown, Ho* is 
rejected if Si/ki < a, that is if its p-value estimate p* is below the constant threshold a. All 
remaining hypotheses are non-rejected. This strategy is employed in all the real data studies 
mentioned in Section [TJ 

We are interested in finding the allocation of samples k\,... ,k m to the hypotheses 
Hoi,..., Ho m which minimizes the expected number of misclassifications, subject to the con¬ 
straint that y^™ii ki = K for a total number of samples K £ N specified in advance by the 
user. 
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Let Mi = {Si/ki > a A pi < a} U {Si/ki < a A pi > a} be the event that hypothesis 
is misclassified. For the Bonferroni (1936) correction under consideration, this event occurs if 
the p-value estimate pi = Si/ki and the ideal p t for H o* are on the two different sides of the 
threshold a. Using the event Mi, the total number of misclassifications can be expressed as 
M = YaLi Iwhere II is the indicator function. When allocating ki samples to hypothesis H o* 
to estimate its p-value, the expected total number of false classifications is thus 


g(k) = E(M\k) = J2 ¥ (Mi\ki) 

2—1 

m 

= ^2 l^( s i/ k i > a \Pi ) ■ KPi < «) + nSi/h < a\p.i ) • I (pi > a)] , 
2 — 1 


where k = (k \,..., k rn ) and S t ~ Binomial(fcj,pi). The function g goes to zero as minj = i v „ jm ki -A 
oo. This is to be expected as by the Central Limit Theorem, the estimates converge to the ideal 
p-values as more samples are drawn, hence the probability for a wrong classification decreases. 


2.2 Deriving the optimal allocation under a normal approximation 

We determine the vector k minimizing g under the condition that = K which is imposed 

via a Kuhn-Tucker constraint (Kuhn and Tucker, 1951). As derivatives are needed for the Kuhn- 
Tucker formalism, the binomial distribution in g was replaced by its normal approximation: 


g{k) 


2 — 1 _ 


1 - $ 


I kj(a-pi) 
\^kiPi(l - pi) 


+ $ 


/ kj(a-p/) 
W kiPi(l - pi) 


l(pi > a) 


I(pi < a) 

=: h{k), 


where is the cumulative distribution function of the standard normal distribution. The deriva¬ 
tive of h is given by 


dg 

dki 


, B " a J fe(a - ft) V I(p, < a) 

dh 2^JkiPi(\ - p/) \^/kiPi(l - pi)) 


+ - 


Pi- a \ , / kj(a - pi) 


2^/kiPi{l - pi) I \^kiPi{l - pi) 


l{pi > a) 


where (j) is the density function of the standard normal distribution. The partial derivative 
dh/dki depends on ki only. 

The function h needs to be optimised under the constraints k > 0 and Y'i=i Y = K, meaning 
that each hypothesis receives at least one sample and that the total number of samples allocated 
equals K. The optimal solution k* minimizing h satisfies 


m 

Vh(k*) = ^2piVui{k*) + AVu(F), 
2=1 


(1) 


where Ui(k) = —ki and v{k) = K — Yi =i Y (Kuhn and Tucker 1951). The functions Ui encode 


the constraints ki > 0 (primal feasibility) with m > 0 (dual feasibility) and satisfy g,iUi{k*) = 0 
(complementary slackness), where i G {1,..., m .}. 
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Complementary slackness and the condition ki > 0 imply that Pi = 0 for all i. As each partial 
derivative dh/dki only depends on ki, the Kuhn-Tucker condition simplifies to dh/dki = —A* 
for i € {1,... ,m} and a A* > 0. 


2.3 Solving for the optimal allocation 

This section looks at useful computational considerations for solving The optimal allocation 
for a normal approximation is obtained by tuning the parameter A towards an optimal value A* 
with the property that the corresponding optimal vector k* = (k*...., k solving Q satisfies 
0 = v(k*). 

The optimal A* can be found using a binary search as follows. Similarly to g, the function h is 
positive and monotonically decreases to zero as mirij = i m ki —> oo. The derivatives dh/dki are 
therefore negative for ki > 0 and monotonically increase to zero as ki —> oo. As a consequence, 
when solving dh/dki = — A for ki, smaller values of A correspond to larger values of ki and vice 
versa. 

This observation applies to all ki, i € {1,..., m}. Thus, for the solution A; of 0 

also monotonically decreases as A —> oo. The optimal value can therefore be found easily with a 
binary search. Likewise, as h is monotonic, the solution ki of dh/dki = — A can be found using 
a binary search in ki for each proposed value of A. 


3 A sampling algorithm based on Thompson Sampling opti¬ 
mally draws samples 

3.1 The QuickMMCTest algorithm 

In this section, a runtime study is conducted to empirically demonstrate that the allocation of 


samples to multiple hypotheses computed by the QuickMMCTest algorithm of Gandy and Hahn 


(2015) asympotically behaves like the optimal allocation. 


QuickMMCTest is used to iteratively allocate samples to all m. hypotheses. The algorithm 
is designed in such a way as to allocate more samples adaptively to those p-values which are 
closer to the testing threshold and thus harder to classify. Although QuickMMCTest is capable, 
in principle, to compute allocations for tests evaluated with arbitrary step-up and step-down 


procedures, we use it in connection with the Bonferroni (1936) correction B(p,a) only (Section 


2.1). Moreover, QuickMMCTest depends on two parameters: the total number of samples K to 
be spent and the total number of iterations it. The choice of K and it is given individually in 
each of the following subsections. 


3.2 Empirical comparison to the optimal allocation 

In the present section, the allocation of samples computed by QuickMMCTest will be compared 
to the continous optimal allocation derived in Section [2j Although we use the continous optimal 
allocation as a reference for the comparison, we show in Appendix [0 that solving the discrete 
optimisation problem using a greedy approach indicates that the discrete and continous optimal 
solutions are most likely similar. 

For all comparisons in this section, we use a mixture distribution introduced in Sandvej 
). It consists of a proportion ttq of the m hypotheses drawn from a Uniform[0,1] 
distribution and the remaining proportion 1 — 7 Tq drawn from a Beta(0.25, 25) distribution. We 


et al. (2011 


4 














Figure 1: Continuous optimal solution (solid line) computed via Kuhn-Tucker and allocation of 
samples returned by QuickMMCTest (crosses). 


fix a realisation generated from this mixture distribution with m = 500 p-values and parameter 
ttq = 0.5. The p-values were then sorted in order to be able to plot the continous optimal 
allocation as a continous curve and hence increase clarity in plots. We refer to the fixed p-values 
as pi, i G {1,..., m}. 

We compute the optimal continous allocation for our fixed distribution as described in Section 
2.3 Additionally, we apply QuickMMCTest to the fixed p-values and record its allocation of 
samples to each hypothesis. Testing was carried out at a Bonferroni (1936) threshold of a = 
0.1/m. 

Figure [I] shows the continous optimal solution (solid curve) as well as the number of samples 
allocated to each hypothesis by QuickMMCTest (crosses). The total number of samples was 
K = 10 6 . The cutoff for multiple testing occurred at the 80th p-value. 

As visible in the plot, p-values far away from the cutoff point and thus far away from the 
threshold are not allocated any samples. On approaching the cutoff, more samples are needed 
to accurately classify p-values closer to the threshold as lying in the rejection or non-rejection 
area. However, p-values which are too close to the threshold on either side are not worth 
being spent too much effort on as deciding them is out of scope for a limited effort (as shown 
in Wald (1945), deciding if a single p-value is above or below a constant threshold takes an 


expected infinite number of samples under suitable assumptions on the distribution of p-values). 
Therefore, the continous optimal allocation decreases again on either side of the intersection 
point, yielding a bimodal allocation. 

The allocation computed by QuickMMCTest roughly resembles the shape of the optimal allo¬ 
cation. QuickMMCTest allocates no sample or just one sample to hypotheses with p-values which 
are considerably above the threshold. Moreover, QuickMMCTest allocates samples to hypotheses 
with p-values above the threshold in such a way that its allocation roughly approximates the 
second mode of the continous allocation. However, the first mode of the continous allocation is 
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theoretical misclassifications 

empirical misclassifications 

optimal 

9.5 

4.5 

QuickMMCTest 

22.8 

7.2 


Table 1: Average empirically observed misclassifications and theoretical misclassifications for 
the optimal allocation and the one returned by QuickMMCTest. 


poorly approximated as QuickMMCTest seems to give an equal number of samples (according to 
Figure [1} about 10000 samples) to each hypothesis below the cutoff. 

The allocation returned by QuickMMCTest and the continous optimal allocation are evaluated 
in two more ways. First, both allocations can be substituted into the function h to obtain a 
theoretical number of misclassifications expected for each allocation. 

Moreover, one can draw a constant number of samples for each hypothesis based on the 
allocation vector returned by both the optimal allocation and QuickMMCTest. Based on Si 
exceedances observed among ki samples drawn for each hypothesis Hoi, i E {1 , we 

estimate its p-value as pi = (Si + 1 )/(&« + 1) and classify each hypothesis based on the estimate. 
Comparing the estimated classification to the one obtained with the known p-values allows to 
compute empirical numbers of misclassified hypotheses. This experiment is carried out a total of 
r repetitions for both the continous optimal allocation and the one returned by QuickMMCTest. 

Table [l] shows averages for r = 10 repetitions. First, the theoretical number of misclassifi¬ 
cations (obtained by evaluating the function h on each allocation) is considerably lower for the 
optimal allocation than for QuickMMCTest. This is natural as QuickMMCTest was never designed 
to minimize the function h whereas the optimal allocation is the minimum of h by definition. 
More illustrative is therefore the number of empirical misclassifications: As shown in Table [I] 
QuickMMCTest performs considerably better in terms of empirical misclassifications and almost 
reaches the limit set by the optimal allocation. 

3.3 Asympotic behavior of the allocation of samples 

Finally, we investigate if the behavior of QuickMMCTest mimics the optimal allocation as the 
total number K of samples to be spent increases. This is done using the fixed set of m = 5000 
p-values for a varying total number of samples K which was increased from 10 5 to 10' in 100 
steps. We calculate 5%, 50% and 95% quantiles of the empirical number of correctly classified 
hypotheses for QuickMMCTest (based on r = 10 repetitions) divided by the theoretical number 
of correctly classified hypotheses (obtained by evaluating h on the optimal allocation derived in 
Section |2.2[ ) . 

Figure [2] shows that the ratio of correct classifications of QuickMMCTest to the ones of 
the optimal allocation seems to converge to one as K —> oo. This gives additional confir¬ 
mation to the observation made for Figure [l] which suggested that the allocation of samples 
of QuickMMCTest seems to resemble the continous optimal one. More precisely, the allocation 
returned by QuickMMCTest seems to behave in such a way as mimic the optimal one in terms of 
correctly classified hypotheses. 
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Figure 2: Convergence of the Thompson solution to the optimal solution. 5%, 50% and 95% 
quantiles of the number of correctly classified hypotheses for QuickMMCTest divided by the 
theoretical number of correctly classified hypotheses. 


4 Discussion 

Many algorithms published in the literature are designed to evaluate multiple hypotheses using 
Monte-Carlo simulations to compensate for unknown p-values corresponding to the hypotheses 
under consideration. Though sensible, however, it is unclear if these algorithms distribute the 
number of samples they draw in an optimal way in the sense that the allocation of samples 
minimizes the number of misclassihed hypotheses. 

The present article is concerned with the allocation of samples that minimizes the expected 
number of misclassihed hypotheses. We derive the optimal allocation of a finite number of 
samples using a Kuhn-Tucker condition under the assumption that the p-values are known and 
that the number of samples is continous. 

We empirically establish that a simple sampling algorithm based on Thompson sampling 
performs asympotically optimally in the sense that its allocation of samples to each hypothesis 
behaves like the optimal allocation. In contrast to the optimal allocation derived using known 
p-values, our algorithm does not need to have knowledge of any p-value as long as it is possible to 
obtain samples under each null hypothesis. Moreover, we show that the algorithm presented in 
the article mimics the optimal allocation for a finite number of samples. QuickMMCTest therefore 
seems to be a suitable choice if near optimal performance is desired. 
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A A greedy solution to the discrete version of the optimal allo¬ 
cation problem 


We investigate one of many possible greedy algorithms which tries to compute a discrete optimal 
allocation for a given set of p-values. The aim of this section is to argue that, although unknown, 
the discrete optimal allocation is most likely of a similar form as the continous optimal solution 
which was derived via a Kuhn and Tucker (1951) condition in Section [2] 

For notational convenience, we express the function g introduced in Section ! as g(k) = 
XX l 9i(h) for k = (fci,..., k m ), where 


gi{h) = F(Mi\ki) = P {Si/h > a\pf) • I (pi < a) + P {Si/ki < a\pf) ■ I (pi > a) 


where M % = {Si/ki > a A pi < a} U {Si/ki <aApj > a} and Si ~ Binomial(/c,;,X (see Section 
[2]). We consider the following greedy approach to computing a discrete optimal allocation for a 
given vector of p-values p = {pi,... ,p m )- 

Apart from the p-values, the greedy algorithm also requires the testing threshold a and the 
total number K of samples to be allocated to the m hypotheses under consideration. In what 
follows, II denotes the indicator function. 


Algorithm 1: Greedy 


input: p, a, K 

1 jump A- 1/a; j 1 ; hi <- 0, k t A- I(pi < a), i E {1, - - - ,m}; 

2 while 1 hi + bj < K do 

3 

4 


k j i — kj T bj ; 

for i •<— 1 to m do 


minjz E N : gi(h + z) < gi(ki)} 

Pi > a, 


jump — 1 

Pi < Oi, 

ki < jump, 

jump 

Pi < Oi, 

ki > jump. 


di A- gi(ki + h) - gi(ki)‘, 
j A- argmaxj = i j ... jm di/bp, 


8 return (ki,... ,k r 


In order to highlight the reasoning behind Algorithm [lj consider the behavior of the function 
gt for a p-value pi below and above the threshold a as illustrated exemplarily in Figure [3j We 
used m = 500 p-values and threshold a = 0.1/m = 1/5000. Figure [3] displays the expected 
number of misclassifications, obtained by evaluating the function h, against number of samples 
for a hypothesis below (left) and above (right) the threshold, where we defined that drawing 0 
samples yields a p-value of 0 and hence a correct rejection if the hypothesis corresponding to 
that p-value is rejected. 
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Figure 3: Example of the expected number of misclassifications (obtained by evaluating the 
function g/) against number of samples for a hypothesis below (left) and above (right) the 
threshold. 


As shown in Figure [lj the behavior of gt exhibits a particular structure. A rejection is 
obtained if a p-value estimate is below the threshold at a = 1/5000. Thus for a p-value below a 
(left plot), when using less than 5000 samples, only in the case when 0 exceedances are observed, 
a p-value will be classified as rejected. Drawing more samples hence only potentially increases 
the probability of making a false decision. When reaching 1/a = 5000 samples, both 0 or 1 
exceedances will lead to a rejection and hence to a correct classification. The expected number 
of misclassifications drops. 

For a p-value above a (right plot), the inverse effect happens. Drawing no samples leads 
to a rejection (a p-value of 0) and thus to a sure misclassification. Drawing more samples 
decreases the expected misclassifications as observing 0 exceedances out of k t samples (the only 
case in which the hypothesis will be rejected and thus misclassified) becomes more unlikely as ki 
approaches 1/a = 5000. When reaching 1/a = 5000 samples, both 0 and 1 exceedances lead to 
a rejection and thus to a misclassification. The expected number of misclassifications increases 
again. 

As expected, the probability for a misclassification of an hypothesis H o* vanishes in both 
cases as ki —> oo. 

Algorithm [I] works in the following way. We greedily try to fill up a zero vector k storing the 
final allocation of samples in such a way as to decrease the function g in every step, subject to 
the constraint YITLi ki = K. In each iteration, we determine a sensible number of bi samples to 
be allocated to hypothesis i. This number varies for each hypothesis: for hypotheses above the 
threshold, bi is determined as the smallest increase that leads to a decrease in g. For hypotheses 
below the threshold, only a jump of 1/a — 1 is sensible if ki = 1 or a jump of 1/a in the case 
ki = 0: that is, we jump to the first point of the rising branches in the left plot of Figure |3j 

Certain jumps might lead to a big decrease in the value of the function g, but only at the 
cost of also spending (too) many samples on just one hypothesis. We therefore determine the 
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Figure 4: Greedy solution of Algorithm [I] (bold) compared to the continous optimal solution 
(dotted) computed as described in Section 2.3 Log scale on the y-axis. 


decrease di for each i and allocate the batch of samples only to the one hypothesis which has 
the best ratio dj/6*. 

Algorithm [l] is just one of many possible greedy approaches to compute a sensible discrete al¬ 
location. We tried various variants and allocation rules, all having similar behavior as Algorithm 
[l]and resulting in similar allocations. 

Figure [4] compares two allocations, a greedy discrete one computed by Algorithm [l and the 
continous optimal one computed via Kuhn and Tucker (1951) as described in Section 2.3 We 


used a p-value distribution with m = 500 hypotheses generated from the mixture distribution 
introduced in Section 3.2 with parameter ttq = 0.5 and carried out testing at a constant (uncor¬ 
rected, that is not divided by m) threshold of 0.1. Figure [4] shows that the two allocations are 
qualitatively similar, in particular for the allocation of samples to hypotheses having p-values 
above the threshold. For p-values below the threshold, Algorithm [l] allocates more samples than 
the optimal allocation. This discrepancy increases as the p-values tend to zero. However, as seen 
in the plot, the optimal allocation distributes fractions of a sample to low p-values. These would, 
for instance, be rounded up to 1 sample per hypothesis, thus making the greedy discrete and 
the optimal continous solutions coincide. Strikingly, the greedy solution does not allocate more 
samples than the pre-set 1 initial sample to rejected hypotheses close to the threshold, whereas 
the continous Kuhn and Tucker (1951) solution allocates samples roughly symmetrically around 
the threshold. 


11 




















